For the past few weeks, the Camp Fire in Northern California has been a deadly and destructive force, recieving enormous media coverage. This deadly wildfire that began on November 8, 2018 has left 85 people dead, 249 missing, and destroyed thousands of buildings. It has resulted in major air quality issues, causing many to wear masks or stay indoors to protect their health. But this is not the first time a major wildfire has ravaged the west coast of the United States.
In this tutorial, we will analyze west coast wildfire data from 2010 to 2016, available through the University of California Irvine Data Science Initiative GitHub organization here. This data has been obtained through the NASA Moderate Resolution Imaging Spectroradiometer (MODIS) satellite. We will begin by loading the data and performing exploratory data analysis before diving deeper into regression analysis and classification analysis using basic machine learning methods. These data science methods will allow us to look at past trends and make inferences about future west coast wildfires.

We will begin by importing Python packages that will help us to load and wrangle the data:
pandas: a package that allows us to build data frames, a tabular data format that can be thought of like a spreadsheetnumpy: a package containing matrix and vector mathematical functionsimport pandas as pd
import numpy as np
The wildfire dataset is contained in a comma-separated values file (CSV) located on GitHub. Pandas has a read_csv function that allows us to load a data frame from a CSV over the web. We will specify the data types of each column using a python dict to ensure that pandas loads the data using the data types that we will want to use in our analysis.
data_url = 'https://raw.githubusercontent.com/UCIDataScienceInitiative/Climate_Hackathon/master/west_coast_fires/fires.csv'
dtype = {
'confidence': float,
'day': int,
'frp': float,
'lat': float,
'lon': float,
'month': int,
'year': int,
'x': float,
'y': float,
'dayofyear': int,
'vdp': float,
'temp': float,
'humidity': float
}
df = pd.read_csv(data_url, index_col=0, dtype=dtype)
df.head()
df.shape
The next step is to determine what this wildfire data means so that we can perform exploration. Luckily, the README for the GitHub repository provides us with a description of each column:
id: unique identifier of the fire in the overall context of the world datasetconfidence: how much confidence the satellite has that this is actually a fire detection (percent)day: the day of the monthfrp: Fire Radiative Power, the strength of the firelat: latitudelon: longitudemonth: month of the yearyear: yearx: x position in a uniformly-spaced gridy: y position in a uniformly-spaced griddayofyear: day of the year (from 0 to 364 or 365 for leap years)vpd: Vapor Pressure Deficit, the difference between the moisture in the air and the amount of moisture the air could holdtemp: temperature (degrees Kelvin)humidity: humidity (percent)To begin our exploration, we can look at the distributions of some of our variables.
To visualize data, we can load helpful python visualization packages:
matplotlibseabornimport matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib import cm
from matplotlib.patches import Patch
from matplotlib.lines import Line2D
import seaborn as sns
# Set matplotlib notebook configuration options
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
First we will look at the distribution of the confidence variable. This variable represents how confident we are that the satellite has correctly classified a specific image as a wildfire. This distribution will give us some insight into how much we can trust the results of our analysis.
_ = plt.title("Distribution of Confidence")
_ = plt.xlabel("Confidence")
_ = plt.ylabel("Count")
_ = plt.xlim(0.0, 1.0)
_ = plt.hist(df['confidence'].values)
This distribution of confidence is left-skewed and its mode seems to lie between 0.9 and 1.0. Based on this distribution, it looks like we can be confident in our analysis that the majority of our data points truly represent wildfires.
To be extra cautious, we can restrict our analysis to those fires with high confidence.
confidence_threshold = 0.8
df = df.loc[df['confidence'] > confidence_threshold]
df.shape
Next, we can look at the Fire Radiative Power variable's distribution to get a sense of how strong fires were between 2010 and 2016.
_ = plt.title("Distribution of Fire Radiative Power")
_ = plt.xlabel("Fire Radiative Power")
_ = plt.ylabel("Count")
_ = plt.hist(df['frp'].values)
This plot shows that many of our data points may have zero values for the Fire Radiative Power variable. We definitely want to exclude fires with a strength of zero:
frp_threshold = 1
df = df.loc[df['frp'] > frp_threshold]
df.shape
_ = plt.title("Distribution of Fire Radiative Power")
_ = plt.xlabel("Fire Radiative Power")
_ = plt.ylabel("Count")
_ = plt.hist(df['frp'].values)
This plot is not very helpful. We can tell that the overwhelming amount of our data points come from fires with low FRP values, but there are clearly some outliers that are not visible on this plot. Perhaps if we log scale the y-axis it will be more informative.
_ = plt.title("Distribution of Fire Radiative Power")
_ = plt.xlabel("Fire Radiative Power")
_ = plt.ylabel("log(Count)")
_ = plt.hist(df['frp'].values, log=True)
We can create these same histograms for the vpd, temp, and humidity variables:
_ = plt.title("Distribution of Vapor Pressure Deficit")
_ = plt.xlabel("Vapor Pressure Deficit")
_ = plt.ylabel("Count")
_ = plt.hist(df['vpd'].values)
_ = plt.title("Distribution of Temperature")
_ = plt.xlabel("Temperature (degrees Kelvin)")
_ = plt.ylabel("Count")
_ = plt.hist(df['temp'].values)
_ = plt.title("Distribution of humidity")
_ = plt.xlabel("Humidity (percent)")
_ = plt.ylabel("Count")
_ = plt.hist(df['humidity'].values)
Next, we can look at the distribution over time using the year variable. To create this histogram we will want to bin by year rather than allowing matplotlib to choose the bins automatically, so we can use pandas groupby to get the counts per year explicitly:
year_df = df.groupby(['year']).size().reset_index(name='count')
year_df.head()
Note that now that we have the explicit counts we can use the matplotlib bar rather than hist.
_ = plt.title("Distribution of Year")
_ = plt.xlabel("Year")
_ = plt.ylabel("Count")
_ = plt.bar(year_df['year'].values, year_df['count'].values, 0.65)
Next, we can look at the distribution over the year using the dayofyear variable.
_ = plt.title("Distribution of Day of Year")
_ = plt.xlabel("Day of Year")
_ = plt.ylabel("Count")
_ = plt.hist(df['dayofyear'].values)
This plot shows that our distribution is fairly symmetric and unimodal, centered around day 225/365. But maybe the distribution will be more clear if we are able to view this data by month rather than day.
We will need to import python's datetime to help with date formatting.
import datetime
datetime can be used to format our month numbers into month names by creating a date object with our month number and then specifying an output date string format that only consists of the month name.
def month_num_to_name(month_num):
date = datetime.date(2018, month_num, 1)
return date.strftime('%B')
We will obtain counts of fires per month using pandas groupby.
month_df = df.groupby(['month']).size().reset_index(name='count')
month_df['month'] = month_df['month'].apply(month_num_to_name)
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Distribution of Date")
_ = plt.xlabel("Month of Year")
_ = plt.ylabel("Count")
bar_width = 0.65
_ = plt.bar(month_df['month'].values, month_df['count'].values, bar_width)
Finally we can look at the entire span of time from 2010 to 2016 and count the fires observed each month.
year_month_df = df.groupby(['year', 'month']).size().reset_index(name='count')
year_month_df['year-month'] = year_month_df.apply(lambda row: ("%s %i" % (month_num_to_name(row['month']), row['year'])), axis='columns')
year_month_df.head()
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Distribution of Year and Month")
_ = plt.xlabel("Year and Month")
_ = plt.ylabel("Count")
_ = plt.bar(year_month_df['year-month'].values, year_month_df['count'].values, 0.65)
_ = plt.xticks(year_month_df['year-month'].values[::3], rotation=70)
Because this wildfire data also contains location data, we can view it on a map as well. We will begin our multivariate analysis by looking at the location data in relationship to other variables.
To do so, we will load python packages to assist with geographical data visualization
cartopy: geographical visualization toolkit for matplotlibimport cartopy.crs as ccrs
#from cartopy.io.img_tiles import OSM
from cartopy.io.img_tiles import Stamen
imagery = Stamen(style='terrain-background')
To determine the extent (ranges) of our maps, we can determine the range of the latitude and longitude variables in the wildfire dataset.
min_lon = df['lon'].min()
max_lon = df['lon'].max()
min_lon -= (max_lon - min_lon) * 0.2
min_lat = df['lat'].min()
max_lat = df['lat'].max()
For our visualization, we will plot fires from different years using different colors. To do so, matplotlib's colormap module can be used to map each year to a color of the rainbow.
We will make this conventient for ourselves later by creating a function that will generalize this mapping to any sorted array.
def get_cmap_dict(vals, cmap_name='rainbow', numeric=True):
cmap = cm.get_cmap(cmap_name)
if numeric:
# assume vals already sorted
min_val = vals[0]
max_val = vals[-1]
return dict(zip(vals, [cmap((val - min_val) / (max_val - min_val)) for val in vals]))
else:
len_vals = len(vals)
return dict(zip(vals, [cmap(val / len_vals) for val in range(len_vals)]))
years = list(df['year'].unique())
year_to_color_map = get_cmap_dict(years)
Because the plot could become cluttered and difficult to interpret, we will first restrict the fires to those with the top strength (frp) for each year.
n_per_year = 30
df_top_frp = df.sort_values(['frp'], ascending=False)
df_top_frp_year_groupby = df_top_frp.groupby(['year'])
Next we will need to determine the minimum and maximum strength values, which will help us to make a legend.
min_frp = df_top_frp_year_groupby.min()['frp'].min()
max_frp = df_top_frp_year_groupby.max()['frp'].max()
frp_intervals = np.linspace(max_frp, min_frp, 4, endpoint=False)
Now we will create the plot.
fig = plt.figure(figsize=(12, 10))
ax = fig.add_subplot(1, 1, 1, projection=imagery.crs)
_ = ax.set_extent([min_lon, max_lon, min_lat, max_lat], ccrs.PlateCarree())
_ = ax.add_image(imagery, 8)
frp_scaling_factor = 0.003
for year, df_top_frp_year in df_top_frp_year_groupby:
for index, row in df_top_frp_year.head(n_per_year).iterrows():
_ = plt.plot(
row['lon'],
row['lat'],
marker='o',
color=year_to_color_map[row['year']],
markersize=(frp_scaling_factor*row['frp']),
alpha=0.5,
transform=ccrs.Geodetic()
)
_ = ax.set_title("West Coast Wildfires 2010-2016, Strongest %i per Year" % n_per_year)
year_legend_elements = [Patch(facecolor=year_color, label=year) for year, year_color in year_to_color_map.items()]
year_legend = ax.legend(handles=year_legend_elements, loc='upper left', title='Year')
_ = plt.gca().add_artist(year_legend)
frp_legend_elements = [Line2D([0], [0], marker='o', color=(0,0,0,0), label=np.floor(frp_val), markerfacecolor='#C0C0C0', markersize=frp_val*frp_scaling_factor) for frp_val in frp_intervals]
_ = ax.legend(handles=frp_legend_elements, loc='lower left', labelspacing=2, columnspacing=2, title='Fire Radiative Power')
Next we can perform multivariate analysis of the rest of our data and ask ourselves whether we see relationships between any of our variables.
In the following plot we will look at temperature over time.
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Temperature by Day of Year")
_ = sns.scatterplot(x="dayofyear", y="temp", data=df)
_ = plt.xlabel("Day of Year")
_ = plt.ylabel("Temperature (kelvin)")
Next, we can incorporate year by coloring each point based on year.
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Temperature by Day of Year")
_ = sns.scatterplot(x="dayofyear", y="temp", hue="year", data=df, linewidth=0)
_ = plt.xlabel("Day of Year")
_ = plt.ylabel("Temperature (kelvin)")
It might be easier to look at the means for each year.
year_dayofyear_df = df.groupby(['year', 'dayofyear']).mean().reset_index()
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Mean Temperature by Day of Year")
_ = sns.scatterplot(x="dayofyear", y="temp", hue="year", data=year_dayofyear_df, linewidth=0)
_ = plt.xlabel("Day of Year")
_ = plt.ylabel("Mean Temperature (kelvin)")
year_month_mean_df = df.groupby(['year', 'month']).mean().reset_index()
month_tick_formatter = matplotlib.ticker.FuncFormatter(lambda x, p: month_num_to_name(int(x)) if x >= 1 and x <= 12 else "")
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Mean Temperature by Month")
ax = sns.lineplot(x="month", y="temp", hue="year", data=year_month_mean_df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Mean Temperature (kelvin)")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)
Because we have many more variable pairs to evaluate, we can start to look at correlations.
corr_df = df.corr()
_ = plt.figure(figsize=(11, 9))
_ = sns.heatmap(corr_df, annot=True, vmin=-1, vmax=1)
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Vapor Pressure Deficit by Temperature")
_ = sns.scatterplot(x="temp", y="vpd", data=df)
_ = plt.xlabel("Temperature (kelvin)")
_ = plt.ylabel("Vapor Pressure Deficit")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Vapor Pressure Deficit by Humidity")
_ = sns.scatterplot(x="humidity", y="vpd", data=df)
_ = plt.xlabel("Humidity")
_ = plt.ylabel("Vapor Pressure Deficit")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Temperature")
ax = sns.scatterplot(x="temp", y="frp", data=df)
_ = plt.xlabel("Temperature (kelvin)")
_ = plt.ylabel("Fire Radiative Power")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Humidity by Temperature")
_ = sns.scatterplot(x="temp", y="humidity", data=df)
_ = plt.xlabel("Temperature (kelvin)")
_ = plt.ylabel("Humidity")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Humidity")
_ = sns.scatterplot(x="humidity", y="frp", data=df)
_ = plt.xlabel("Humidity")
_ = plt.ylabel("Fire Radiative Power")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Vapor Pressure Deficit")
_ = sns.scatterplot(x="vpd", y="frp", data=df)
_ = plt.xlabel("Vapor Pressure Deficit")
_ = plt.ylabel("Fire Radiative Power")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Confidence")
_ = sns.scatterplot(x="confidence", y="frp", data=df)
_ = plt.xlabel("Confidence")
_ = plt.ylabel("Fire Radiative Power")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Month")
ax = sns.scatterplot(x="month", y="frp", data=df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Fire Radiative Power")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Humidity by Month")
ax = sns.scatterplot(x="month", y="humidity", data=df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Humidity")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Vapor Pressure Deficit by Month")
ax = sns.scatterplot(x="month", y="vpd", data=df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Vapor Pressure Deficit")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)
sns.lmplot(x="temp", y="frp", data=df);